logo

Introduction

All provided data has been presumed as truth (for example, some “home” games have been played at secondary locations, including TOR’s entire 2020-21 season. These are not reflected in the data and I was told not to account for this.) Note that the OKC and DEN 2024-25 schedules in schedule_24_partial.csv intentionally include only 80 games, as the league holds 2 games out for each team in the middle of December due to unknown NBA Cup matchups. I was told not to assign specific games to fill those two slots.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. We may refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers

Part 1

Question 1: 26 4-in-6 stretches in OKC’s draft schedule.

Question 2: 24.6 4-in-6 stretches on average.

Question 3:

  • Most 4-in-6 stretches on average: CHA (27.7)
  • Fewest 4-in-6 stretches on average: NOP (21.8)

Question 4: This is a written question. Response is in the document under Question 4.

Question 5:

  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

Part 2

Work is shown in the document.

Part 3

Question 8:

  • Most Helped by Schedule: MIL (6.0 wins)
  • Most Hurt by Schedule: DET (-6.4 wins)

Setup and Data

library(tidyverse)

schedule <- read_csv("C:/Users/elias/Downloads/Technical Project OKC/schedule.csv")
draft_schedule <- read_csv("C:/Users/elias/Downloads/Technical Project OKC/schedule_24_partial.csv")
locations <- read_csv("C:/Users/elias/Downloads/Technical Project OKC/locations.csv")
game_data <- read_csv("C:/Users/elias/Downloads/Technical Project OKC/team_game_data.csv")

Part 1 – Schedule Analysis

In this section, you’re going to work to answer questions using NBA scheduling data.

Question 1

QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)

Number of 4-in-6 Stretches

# First filter out the Nuggets games to focus on the Thunder's schedule.
library(dplyr)
okc_schedule <- draft_schedule |>
  filter(team == "OKC")

# Compare the dates of the first and last games of every 4 game stretch.
# Sum up the number of instances where the first game of the 4 game stretch 
# took place 5 days before the last one. These are the 4-in-6 stretches
okc_schedule |>
  mutate(num_days = gamedate - lead(gamedate, n = 3)) |>
  summarize("Number of 4-in-6 Stretches" = sum(num_days == 5, na.rm = TRUE))
## # A tibble: 1 × 1
##   `Number of 4-in-6 Stretches`
##                          <int>
## 1                           26

ANSWER 1:

26 4-in-6 stretches in OKC’s draft schedule.

Question 2

QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.

# Note: the prompt solely asks about 4-in-6 stretches, so 4-in-5 stretches from
# before 2018 are not counted or included

# Count the total 4-in-6 stretches of every team then divide to find the per 82 games 
# average for every team. Then take the mean of each team's average.
schedule |>
  group_by(team) |>
  mutate(num_days = gamedate - lead(gamedate, n = 3)) |>
  summarize(avg_4in6 = sum(num_days == 5, na.rm = TRUE) / (length(gamedate) / 82)) |>
  summarize("Average Number of 4-in-6 Stretches for a Team in Season" = mean(avg_4in6))
## # A tibble: 1 × 1
##   `Average Number of 4-in-6 Stretches for a Team in Season`
##                                                       <dbl>
## 1                                                      24.6

ANSWER 2:

24.6 4-in-6 stretches on average.

Question 3

QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.

# Reuse code from Q2 for each team's average.
team_averages <- schedule |>
  group_by(team) |>
  mutate(num_days = gamedate - lead(gamedate, n = 3)) |>
  summarize(avg_4in6 = sum(num_days == 5, na.rm = TRUE) / (length(gamedate) / 82))
 
# Find the highest and lowest averages and the corresponding teams.
# Paste together teams and values into one string for cleaner printing
# Ensure the two averages are rounded to the same digit
team_averages |>
  summarize("Highest Average Team" = 
              paste(team[which.max(avg_4in6)], 
                    " (", round(max(avg_4in6), digits = 4), ")", sep = ""), 
            "Lowest Average Team" = 
              paste(team[which.min(avg_4in6)], " (", min(avg_4in6), ")", sep = ""))
## # A tibble: 1 × 2
##   `Highest Average Team` `Lowest Average Team`
##   <chr>                  <chr>                
## 1 CHA (27.7125)          NOP (21.8325)

ANSWER 3:

  • Most 4-in-6 stretches on average: CHA (27.7)
  • Fewest 4-in-6 stretches on average: NOP (21.8)

Question 4

QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?

# Check normalcy of distribution
# Use histogram and summary statistics
averages <- team_averages$avg_4in6
hist(averages, breaks = 8)

summary(averages)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.83   23.49   24.81   24.58   25.72   27.71
# Calculate z-scores
z_score_max <- max(scale(averages))
z_score_max
## [1] 2.04838
z_score_min <- min(scale(averages))
z_score_min
## [1] -1.792808
# Use IQR Method to test if max and min appear to be outliers
q1 <- quantile(averages, 0.25)
q3 <- quantile(averages, 0.75)
iqr = q3 - q1

lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
as.logical(min(averages) > lower_bound & max(averages) < upper_bound)
## [1] TRUE
# Use Grubbs' Test to test likelihood that max and min are outliers
library(outliers)
#Grubbs' Test for maximum
grubbs.test(averages, type = 10)
## 
##  Grubbs test for one outlier
## 
## data:  averages
## G = 2.04838, U = 0.85033, p-value = 0.5203
## alternative hypothesis: highest value 27.7124842370744 is an outlier
#Grubbs' Test for minimum
grubbs.test(averages, type = 10, opposite = TRUE)
## 
##  Grubbs test for one outlier
## 
## data:  averages
## G = 1.79281, U = 0.88535, p-value = 1
## alternative hypothesis: lowest value 21.8325 is an outlier

ANSWER 4:

The difference between most and least from Q3 is not surprising and that size difference is likely to be the result of chance. Having confirmed the normality of the distribution, z-scores showed that the maximum and minimum were both around 2 standard deviations away from the mean, well within the range of values expected in a normal distribution. Additionally, using the IQR method, both the maximum and minimum fell within the range of expected values and would not be considered outliers. Lastly, Grubbs’ Test returned high p-values for both the maximum and the minimum values, meaning that both are unlikely to be outliers. Overall, each of the multiple methods used all returned the same answer that the size difference between the most and least from Q3 are likely to be the result of chance.

Question 5

QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?

# Create a function to calculate eFG%
# Inputs: field goals made, 3 point field goals made, field goals attempted
# Outputs: eFG%
e_fg <- function(fgm, fgm3, fga) {
  efg <- (fgm + (0.5 * fgm3)) / fga
  return(efg)
}

# For BKN defensive eFG% filter to defensive entries for the correct team and season.
# Then sum the appropriate columns and enter the totals eFG% function.
# Multiply by 100 to convert from decimal to percentage.
game_data |>
  filter(def_team == "BKN" & season == 2023) |>
  summarize("BKN Defensive eFG%" = e_fg(sum(fgmade), sum(fg3made), sum(fgattempted)) * 100)
## # A tibble: 1 × 1
##   `BKN Defensive eFG%`
##                  <dbl>
## 1                 54.3
# Separate out each team's schedule and arrange by game date. 
# Filter to games one day after the previous game where the opponent is BKN.
# Use eFG% function and convert from decimal to percentage.
game_data |>
  group_by(off_team) |>
  arrange(gamedate) |>
  filter(season == 2023 & gamedate - lag(gamedate) == 1 & def_team == "BKN") |>
  ungroup() |>
  summarize("BKN Defensive eFG% (Opponent on Second Night of Back-to-Back)" = 
              e_fg(sum(fgmade), sum(fg3made), sum(fgattempted)) * 100)
## # A tibble: 1 × 1
##   `BKN Defensive eFG% (Opponent on Second Night of Back-to-Back)`
##                                                             <dbl>
## 1                                                            53.5

ANSWER 5:

  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

Part 3 – Modeling

Question 9

QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.

If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).

# Modify the schedule to include longitude, latitude, days between games, distance
# between games, whether the game is the second leg of a back-to-back.
# Deal with NA values.
enhanced_schedule <- schedule |>
  group_by(season, team, opponent) |>
  
  # Longitude and Latitude
  mutate(longitude = if_else(home == 0, locations$longitude[locations$team == opponent[1]], 
                             locations$longitude[locations$team == team[1]]),
         latitude = if_else(home == 0, locations$latitude[locations$team == opponent[1]], 
                            locations$latitude[locations$team == team[1]])) |>
  ungroup(opponent) |>
  
  mutate(# Days between Games 
         days_between = gamedate - lead(gamedate), 
         days_between = replace(days_between, is.na(days_between), 10),
         
         # Distance between Games
         prev_longitude = lead(longitude),
         prev_latitude = lead(latitude),
         distance = distHaversine(pick(prev_longitude, prev_latitude), pick(longitude, latitude)) / 1609.344,
         distance = replace(distance, is.na(distance), 0),
         is_b2b = gamedate - lead(gamedate) == 1,
         is_b2b = replace(is_b2b, is.na(is_b2b), FALSE)) |>
  
  # Remove extraneous columns and sort by date
  select(-prev_longitude, -prev_latitude) |>
  arrange(gamedate)

# Create temporary schedule to facilitate opponent net rating calculation
temp_schedule <- enhanced_schedule |>
  group_by(season, team, gamedate) |>
  
  # Add points and possessions as columns. For both offense and defense
  mutate(points = game_data$points[game_data$gamedate == gamedate & game_data$off_team == team],
         possessions = game_data$possessions[game_data$gamedate == gamedate & 
                                               game_data$off_team == team], 
         defpoints = game_data$points[game_data$gamedate == gamedate & 
                                        game_data$off_team == opponent],
         defpossessions = game_data$possessions[game_data$gamedate == gamedate & 
                                                  game_data$off_team == opponent]) |>
  ungroup(gamedate) |>
  
  #Compute season net rating updated every game
  mutate(ortg = cumsum(points) / (cumsum(possessions) / 100), 
         drtg = cumsum(defpoints) / (cumsum(defpossessions) / 100),
         nrtg = ortg - drtg) |>
  ungroup()

# Add opponent's net rating for the season to each game  
enhanced_schedule <- enhanced_schedule |>
  left_join(
    temp_schedule |> 
      select(gamedate, team = opponent, opp_nrtg = nrtg),
    by = c("gamedate", "team")
  )

  
 # Examine correlations with correlation matrix
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
data_filtered <- enhanced_schedule |> 
  ungroup() |>
  select(win, home, days_between, distance, is_b2b, opp_nrtg)
 
data_filtered  <- data_filtered |>
  mutate(across(everything(), as.numeric))
 
correlation_matrix <- cor(data_filtered)

corrplot(correlation_matrix, method = "color", type = "full",
         addCoef.col = "black",
         number.cex = 0.7)

  # Remove days between the game and prior game since it was the weakest correlation
schedule_model <- glm(win ~ home + distance + is_b2b + opp_nrtg, data = enhanced_schedule, family = binomial)
  
# Filter to correct seasons  
enhanced_schedule <- enhanced_schedule |>
  filter(season >= 2019 & season <= 2023) |>
  ungroup(season)


# Now calculate number of wins gained and lost for each team.

schedule_vars <- c("home", "distance", "is_b2b", "opp_nrtg")

# Find average values for each variable
baseline_vals <- enhanced_schedule |>
  summarize(across(all_of(schedule_vars), ~ mean(.x, na.rm = TRUE)))

# Counterfactual dataset with neutral schedule
schedule_cf <- enhanced_schedule
for (i in schedule_vars) {
  schedule_cf[[i]] <- baseline_vals[[i]][1]
}
schedule_cf$is_b2b <- FALSE

# Use the model on both datasets
pred_actual <- predict(schedule_model, newdata = enhanced_schedule, type = "response")
pred_cf     <- predict(schedule_model, newdata = schedule_cf, type = "response")

# Subtract between the two win totals
results <- enhanced_schedule |>
  ungroup() |>
  mutate(pred_actual = pred_actual,
         pred_cf = pred_cf,
         diff = pred_actual - pred_cf) |>
  group_by(team) |>
  summarise(
    expected_wins_actual = sum(pred_actual, na.rm = TRUE),
    expected_wins_cf     = sum(pred_cf, na.rm = TRUE),
    schedule_effect      = sum(diff, na.rm = TRUE)
  ) |>
  ungroup() |>
  mutate(
    schedule_effect = schedule_effect - mean(schedule_effect, na.rm = TRUE)
  ) |>
  arrange(desc(schedule_effect))

print(results)
## # A tibble: 30 × 4
##    team  expected_wins_actual expected_wins_cf schedule_effect
##    <chr>                <dbl>            <dbl>           <dbl>
##  1 MIL                   201.             201.            5.95
##  2 LAC                   200.             200.            5.36
##  3 BOS                   199.             200.            4.42
##  4 UTA                   198.             200.            3.61
##  5 PHX                   199.             201.            3.54
##  6 DEN                   198.             201.            2.99
##  7 DAL                   199.             202.            2.79
##  8 MIA                   198.             201.            2.62
##  9 BKN                   197.             200.            1.91
## 10 TOR                   196.             200.            1.63
## # ℹ 20 more rows
# Print top and bottom 3

cat("\nMost helped:\n")
## 
## Most helped:
print(slice_max(results, schedule_effect, n = 3) |>
        select(team, schedule_effect))
## # A tibble: 3 × 2
##   team  schedule_effect
##   <chr>           <dbl>
## 1 MIL              5.95
## 2 LAC              5.36
## 3 BOS              4.42
cat("\nMost hurt:\n")
## 
## Most hurt:
print(slice_min(results, schedule_effect, n = 3) |>
        select(team, schedule_effect))
## # A tibble: 3 × 2
##   team  schedule_effect
##   <chr>           <dbl>
## 1 DET             -6.37
## 2 SAS             -5.73
## 3 ORL             -5.71

ANSWER 9:

  • Most Helped by Schedule: MIL (6.0 wins)
  • Most Hurt by Schedule: DET (-6.4 wins)

Model Explanation (refer to correlation map above): My model took into account whether a game was at home, whether a game was a back to back, how far the team travelled to get to the game, and the opponent’s net rating to predict whether a team would win the game.